[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors by zhouyifan279 · Pull Request #5141 · apache/kyuubi

zhouyifan279 · 2023-08-08T01:50:16Z

Why are the changes needed?

In minor cases, Spark Stage hangs forever when spark.sql.finalWriteStage.eagerlyKillExecutors.enabled is true.

The bug occurs if two conditions are met in the same time:

All executors are either removed because of idle time out or killed by FinalStageResourceManager.
Target executor num in YarnAllocator will be set to 0 and no more executor will be launched.
Target executor num in ExecutorAllocationManager equals to the executor num needed by final stage.
Then ExecutorAllocationManager will not sync target executor num to YarnAllocator.

How was this patch tested?

Add a new test suite FinalStageResourceManagerSuite

ulysses-you · 2023-08-08T02:17:12Z

...uubi-extension-spark-3-3/src/main/scala/org/apache/spark/sql/FinalStageResourceManager.scala

-      adjustTargetNumExecutors = true,
+      adjustTargetNumExecutors = false,
      countFailures = false,
      force = false)


Shall we call client.requestTotalExecutors to adjust the target executor if targetExecutors < draTargetExecutors after kill executors ?

We might face a issue similar with apache/spark#19048

IMO, we just need to tune adjustTargetNumExecutors from true to false and call client.requestTotalExecutors to ensure the final target executor is expected.

Agree. But build parameters of client.requestTotalExecutors involves many details.
So I prefer to wait ExecutorAllocationManager to call client.requestTotalExecutors.

As long as we did not call killExecutors with adjustTargetNumExecutors = true, ExecutorAllocationManager should be able to manage target executor num correctly.

ulysses-you · 2023-08-08T02:30:04Z

...uubi-extension-spark-3-3/src/main/scala/org/apache/spark/sql/FinalStageResourceManager.scala

+
+    if (draTargetExecutors <= targetExecutors) {
+      // Ensure target executor number has been updated in cluster manager client
+      executorAllocationClient.requestExecutors(0)


I'm not sure I get it correctly. It seems we should do nothing if the dra target executor is smaller than our target executor. If it happens, the executors are pending to kill.

codecov-commenter · 2023-08-08T03:18:59Z

Codecov Report

Merging #5141 (ea8f247) into master (9a001c8) will not change coverage.
Report is 13 commits behind head on master.
The diff coverage is n/a.

❗ Current head ea8f247 differs from pull request most recent head c4403ee. Consider uploading reports for the commit c4403ee to get more accurate results

@@          Coverage Diff           @@
##           master   #5141   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         566     566           
  Lines       31590   31654   +64     
  Branches     4120    4124    +4     
======================================
- Misses      31590   31654   +64

see 20 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

…ourceManager killed all executors

zhouyifan279 · 2023-08-15T08:13:32Z

.github/workflows/master.yml

            spark: '3.3'
            spark-archive: '-Dspark.archive.mirror=https://archive.apache.org/dist/spark/spark-3.1.3 -Dspark.archive.name=spark-3.1.3-bin-hadoop3.2.tgz -Pzookeeper-3.6'
-            exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest'
+            exclude-tags: '-Dmaven.plugin.scalatest.exclude.tags=org.scalatest.tags.Slow,org.apache.kyuubi.tags.DeltaTest,org.apache.kyuubi.tags.IcebergTest,org.apache.kyuubi.tags.SparkLocalClusterTest'


Task serialized by Spark 3.3 Driver can not be deserialized by Spark 3.1.3 Executor

zhouyifan279 · 2023-08-15T10:42:36Z

cc @ulysses-you @pan3793 Ready for review.

ulysses-you · 2023-08-16T01:50:57Z

...extension-spark-3-3/src/test/scala/org/apache/spark/sql/FinalStageResourceManagerSuite.scala

+        eventually(timeout(Span(10, Minutes))) {
+          sql(
+            "CREATE TABLE final_stage AS SELECT id, count(*) as num FROM (SELECT 0 id) GROUP BY id")
+        }


shall we getAdjustedTargetExecutors at the end to make sure the target executor number is 1 ?

pan3793 · 2023-08-16T08:09:34Z

Thanks, merged to master

github-actions bot added module:spark module:extensions labels Aug 8, 2023

zhouyifan279 changed the title ~~[KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe…~~ [WIP][KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe… Aug 8, 2023

ulysses-you reviewed Aug 8, 2023

View reviewed changes

zhouyifan279 changed the title ~~[WIP][KYUUBI #5136][Bug] Some Spark App may hang forever when FinalStageRe…~~ [WIP][KYUUBI #5136][Bug] Spark Application may hang forever after FinalStageResourceManager kills all the executors Aug 8, 2023

ulysses-you reviewed Aug 8, 2023

View reviewed changes

zhouyifan279 changed the title ~~[WIP][KYUUBI #5136][Bug] Spark Application may hang forever after FinalStageResourceManager kills all the executors~~ [WIP][KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors Aug 9, 2023

zhouyifan279 changed the title ~~[WIP][KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors~~ [KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors Aug 9, 2023

[KYUUBI apache#5136][Bug] Spark App may hang forever if FinalStageRes…

9dcbc78

…ourceManager killed all executors

zhouyifan279 force-pushed the adjust-executors branch from 92aaa08 to 9dcbc78 Compare August 9, 2023 06:10

github-actions bot added the kind:build label Aug 9, 2023

zhouyifan279 added 2 commits August 14, 2023 12:15

[KYUUBI apache#5136][Bug] Spark App may hang forever if FinalStageRes…

12687ee

…ourceManager killed all executors

[KYUUBI apache#5136][Bug] Spark App may hang forever if FinalStageRes…

5f3ca1d

…ourceManager killed all executors

github-actions bot added the kind:infra license, community building, project builds, asf infra related, etc. label Aug 15, 2023

zhouyifan279 commented Aug 15, 2023

View reviewed changes

Add comment

ea8f247

ulysses-you reviewed Aug 16, 2023

View reviewed changes

assert adjustedTargetExecutors == 1

c4403ee

ulysses-you approved these changes Aug 16, 2023

View reviewed changes

pan3793 assigned zhouyifan279 Aug 16, 2023

pan3793 added this to the v1.8.0 milestone Aug 16, 2023

pan3793 closed this in d513f1f Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors#5141

[KYUUBI #5136][Bug] Spark App may hang forever if FinalStageResourceManager killed all executors#5141
zhouyifan279 wants to merge 5 commits intoapache:masterfrom
zhouyifan279:adjust-executors

zhouyifan279 commented Aug 8, 2023 •

edited

Loading

Uh oh!

ulysses-you Aug 8, 2023

Uh oh!

ulysses-you Aug 8, 2023

Uh oh!

zhouyifan279 Aug 8, 2023

Uh oh!

ulysses-you Aug 8, 2023

Uh oh!

codecov-commenter commented Aug 8, 2023 •

edited

Loading

Uh oh!

zhouyifan279 Aug 15, 2023

Uh oh!

zhouyifan279 commented Aug 15, 2023

Uh oh!

ulysses-you Aug 16, 2023

Uh oh!

zhouyifan279 Aug 16, 2023

Uh oh!

pan3793 commented Aug 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

zhouyifan279 commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

How was this patch tested?

Uh oh!

ulysses-you Aug 8, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you Aug 8, 2023

Choose a reason for hiding this comment

Uh oh!

zhouyifan279 Aug 8, 2023

Choose a reason for hiding this comment

Uh oh!

ulysses-you Aug 8, 2023

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Aug 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhouyifan279 Aug 15, 2023

Choose a reason for hiding this comment

Uh oh!

zhouyifan279 commented Aug 15, 2023

Uh oh!

ulysses-you Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

zhouyifan279 Aug 16, 2023

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Aug 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhouyifan279 commented Aug 8, 2023 •

edited

Loading

codecov-commenter commented Aug 8, 2023 •

edited

Loading